|
The Universal Coded Character Set (UCS), is a standard set of characters defined by the International Standard ISO/IEC 10646, ''Information technology — Universal Coded Character Set (UCS)'' (plus amendments to that standard), which is the basis of many character encodings. The UCS contains nearly 100,000 abstract characters, each identified by an unambiguous name and an integer number called its code point. Characters (letters, numbers, symbols, ideograms, logograms, etc.) from the many languages, scripts, and traditions of the world are represented in the UCS with unique code points. The inclusiveness of the UCS is continually improving as characters from previously unrepresented writing systems are added. Since 1991, the Unicode Consortium and the ISO have developed ''The Unicode Standard'' ("Unicode") and ISO/IEC 10646 in tandem. The repertoire, character names, and code points of Unicode Version 2.0 exactly match those of ISO/IEC 10646-1:1993 with its first seven published amendments. After Unicode 3.0 was published in February 2000, corresponding new and updated characters entered the UCS via ISO/IEC 10646-1:2000. In 2003, parts 1 and 2 of ISO/IEC 10646 were combined into a single part, which has since had a number of amendments adding characters to the standard in approximate synchrony with the Unicode standard. The UCS has over 1.1 million code points available for use, but only the first 65,536 (the Basic Multilingual Plane, or BMP) had entered into common use before 2000. This situation began changing when the People's Republic of China (PRC) ruled in 2000 that all software sold in its jurisdiction would have to support GB 18030. This required software intended for sale in the PRC to move beyond the BMP. The system deliberately leaves many code points not assigned to characters, even in the BMP. It does this to allow for future expansion or to minimize conflicts with other encoding forms. ==Encoding forms== ISO 10646 defines several character ''encoding forms'' for the Universal Coded Character Set. The simplest, UCS-2,〔see UTF-16 for a more detailed discussion of UCS-2〕 uses a single ''code value'' (defined as one or more numbers representing a code point) between 0 and 65,535 for each character, and allows exactly two bytes (one 16-bit word) to represent that value. UCS-2 thereby permits a binary representation of every code point in the BMP, as long as the code point represents a character. UCS-2 cannot represent code points outside the BMP. Occasionally, articles about Unicode will mistakenly refer to UCS-2 as "UCS-16". UCS-16 does not exist; the authors who make this error usually intend to refer to UCS-2 or to UTF-16. The first amendment to the original edition of the UCS defined UTF-16, an extension of UCS-2, to represent code points outside the BMP. A range of code points in the S (Special) Zone of the BMP remains unassigned to characters. UCS-2 disallows use of code values for these code points, but UTF-16 allows their use in pairs. Unicode also adopted UTF-16, but in Unicode terminology, the high-half zone elements become "high surrogates" and the low-half zone elements become "low surrogates". Another encoding, UCS-4, uses a single code value between 0 and (theoretically) hexadecimal 7FFFFFFF for each character (although the UCS stops at 10FFFF and ISO/IEC 10646 has stated that all future assignments of characters will also take place in that range). UCS-4 allows representation of each value as exactly four bytes (one 32-bit word). UCS-4 thereby permits a binary representation of every code point in the UCS, including those outside the BMP. As in UCS-2, every encoded character has a fixed length in bytes, which makes it simple to manipulate, but of course it requires twice as much storage as UCS-2. Currently, the dominant UCS encoding is UTF-8, which is a variable-width encoding designed for backward compatibility with ASCII, and for avoiding the complications of endianness and byte-order marks in UTF-16 and UTF-32. More than half of all Web pages are encoded in UTF-8. The Internet Engineering Task Force (IETF) requires all Internet protocols to identify the encoding used for character data, and the supported character encodings must include UTF-8. The Internet Mail Consortium (IMC) recommends that all e-mail programs be able to display and create mail using UTF-8. It is also increasingly being used as the default character encoding in operating systems, programming languages, APIs, and software applications. See also Comparison of Unicode encodings. 抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)』 ■ウィキペディアで「Universal Coded Character Set」の詳細全文を読む スポンサード リンク
|